When chaining multiple operations it is worthwhile to think about which operations to execute first in order to optimize the sentence. Filter steps should be executed as early as possible
It is always recommended to filter the data where the data lives, for example, in other words in BigQuery, but if this is not the case, you should filter your dataframe as soon as possible to only work with the data you need, thus optimizing your operations.
⚙️ Precondition: Getting Data
import pandas as pdimport numpy as npimport timetitanic_data = pd.read_csv('test.csv')passenger_id =895titanic_data
PassengerId
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
892
3
Kelly, Mr. James
male
34.5
0
0
330911
7.8292
NaN
Q
1
893
3
Wilkes, Mrs. James (Ellen Needs)
female
47.0
1
0
363272
7.0000
NaN
S
2
894
2
Myles, Mr. Thomas Francis
male
62.0
0
0
240276
9.6875
NaN
Q
3
895
3
Wirz, Mr. Albert
male
27.0
0
0
315154
8.6625
NaN
S
4
896
3
Hirvonen, Mrs. Alexander (Helga E Lindqvist)
female
22.0
1
1
3101298
12.2875
NaN
S
...
...
...
...
...
...
...
...
...
...
...
...
413
1305
3
Spector, Mr. Woolf
male
NaN
0
0
A.5. 3236
8.0500
NaN
S
414
1306
1
Oliva y Ocana, Dona. Fermina
female
39.0
0
0
PC 17758
108.9000
C105
C
415
1307
3
Saether, Mr. Simon Sivertsen
male
38.5
0
0
SOTON/O.Q. 3101262
7.2500
NaN
S
416
1308
3
Ware, Mr. Frederick
male
NaN
0
0
359309
8.0500
NaN
S
417
1309
3
Peter, Master. Michael J
male
NaN
1
1
2668
22.3583
NaN
C
418 rows × 11 columns
🔧 Filtering Methods
Pandas provides a lot of methods for data selection, below you can review some methods.
1. Index []
Indexing is the easier way of filtering the data where the condition expression creates a Boolean series, and we can use it to filter the DataFrame.
This is a much cleaner and easier way to filter rows. Also, query() supports much more complicated conditional expressions and is faster than using [].
✅ The traditional method [] is faster for smaller arrays.
✅ The benefit of eval and query is mainly in the saved memory, and the sometimes cleaner syntax they offer.
✅ The advantages of eval and query lies in humongous dataset.
✅ It is recommended to use eval or query when you work with a lot of data.
👀 Optimizing the eval(), query(), [] filter operations would not necessarily guarantee performance improvement because it’s a multivariate equation.
🔍 Recommendatios for Faster Lookup
Here, there are some strategies and properties useful to lookup.
# Variablesposition =3column ="Ticket"
Index Optimization
Pandas has optimized operations based on indices whereby It is recommended to use an index in dataframes to allow for faster lookup.
Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.
# 1. We identify PassengerId as a candidate variables to use as indextitanic_data_indexed = titanic_data.set_index("PassengerId", drop=False, inplace=False)start = time.time()#2. Filtering a dataframe using PassengerId columnpassengers_by_ticket_eval = titanic_data_indexed[titanic_data_indexed.eval('PassengerId == @passenger_id')]#3. Duration timeprint(f"time : {(time.time() - start) *1000} ms")
time : 4.584312438964844 ms
Lookup a Single Value
When we need to retrieve a single value from a dataframe it’s recommended to use .at[] because is faster than using .loc[].
.loc
The .loc property of the DataFrame object allows the return of specified rows and/or columns from that DataFrame.
df.loc[rows,columns]
Note: .loc is not a method, it is a property indexed via square brackets.
Vectorization is the process of executing operations on entire arrays. Similarly to NumPy.
It is recommended to avoid for loops when working with dataframes, because read and write operations are expensive. When looping is unavoidable, use native NumPy, or .map() for simple operations.
###Verify Memory Usage
Every time when you work with a dataframe verify the memory usage, to achieve this you can use the functions: info() or memory_usage().
This function return the memory usage of each column in bytes.
titanic_data.memory_usage(index=False, deep=True)
PassengerId 3344
Pclass 3344
Name 35314
Sex 25802
Age 3344
SibSp 3344
Parch 3344
Ticket 26700
Fare 3344
Cabin 16022
Embarked 24244
dtype: int64
Memory Optimization
When the dataset is read using Pandas read function like read_csv or read_excel, Pandas decides the data type and loads it into RAM. Normally for integer values Pandas assign int64, float values are assigned float64, and string values are assigned as objects, The problem here is that using an int64 takes up more memory compared to int8(8 times more).
The idea is to downgrade the datatype reviewing el max and min value of a column and choose which is the correct data type for a specific column
# 2. Show memory usage for the PassengerId column: 3344 bytestitanic_data.memory_usage(index=False, deep=True)
PassengerId 3344
Pclass 3344
Name 35314
Sex 25802
Age 3344
SibSp 3344
Parch 3344
Ticket 26700
Fare 3344
Cabin 16022
Embarked 24244
dtype: int64
#3. Get the Max and Min value for PassengerId column and decide the best datatype:min_value = titanic_data["PassengerId"].min()max_value = titanic_data["PassengerId"].max()print(f"Min Value: {min_value} - Max Value: {max_value}")
Min Value: 892 - Max Value: 1309
Analysis
▶ The PassengerId column has values between 892 to 1309.
▶ This range does not contain negative numbers.
▶ Pandas assigned int64 to this column, but the range of int64 is Integer (-9223372036854775808 to 9223372036854775807):
🔴 It’s a wide range.
🔴 Allows negative numbers.
🟢 We can review the range of the different data types in the above table and choose the best range.
🟢 So, we finally decide to use uint16, but why is it the best option?
✔ range of uint16 is: Unsigned integer (0 to 65535)
✔ The range is enough to contain the values for the PassengerId Column.
✔ We only need a positive number
#4 . Set PassengerId column to uint16 datatype:titanic_data["PassengerId"] = titanic_data["PassengerId"].astype("uint16")
# 5. Show memory usage for the PassengerId column again:titanic_data.memory_usage(index=False, deep=True)# PassengerId has reduced from to 3344 bytes 836 bytes
PassengerId 836
Pclass 3344
Name 35314
Sex 25802
Age 3344
SibSp 3344
Parch 3344
Ticket 26700
Fare 3344
Cabin 16022
Embarked 24244
dtype: int64